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1 Introduction 



The prediction error of standard nonparametric regression methods may be 
critically affected by a linear transformation of the coordinate axes. It is 
typically the case for the popular fc-nearest neighbor (fc-NN) predictor (Fix 
and Hodges [H1H2], Cover and Hart [7J, Cover where a mere rescal- 

ing of the coordinate axes has a serious impact on the capabilities of this 
estimate. This is clearly an undesirable feature, especially in applications 
where the data measurements represent physically different quantities, such 
as temperature, blood pressure, cholesterol level, and the age of the patient. 
In this example, a simple change in, say, the unit measure of the tempera- 
ture parameter will lead to totally different results, and will thus force the 
statistician to use a somewhat arbitrary preprocessing step prior to the fc-NN 
estimation process. Furthermore, in several practical implementations, one 
would like, for physical or economical reasons, to supply the freshly collected 
data to some machine without preprocessing. 

In this paper, we discuss a variation of the fc-NN regression estimate whose 
definition is not affected by affine transformations of the coordinate axes. 
Such a modification could save the user a subjective preprocessing step and 
would save the manufacturer the trouble of adding input specifications. 

The data set we have collected can be regarded as a collection of inde- 
pendent and identically distributed M. d x M-valued random variables T> n = 
{(Xi, Yi), . . . , (X n , Y n )}, independent of and with the same distribution as a 
generic pair (X, Y) satisfying E|Y| < oo. The space M. d is equipped with the 
standard Euclidean norm ||.||. For fixed x e our goal is to estimate the 
regression function r(x) = E[Y|X = x] using the data T> n . In this context, 
the usual fc-NN regression estimate takes the form 



where (X(!)(x), Y^^x)), . . . , (X( n )(x), Y( n )(x)) is a reordering of the data ac- 
cording to increasing distances ||Xj — x|| of the Xj's to x. (If distance ties 
occur, a tie-breaking strategy must be defined. For example, if ||Xj — x|| = 
||Xj — x||, Xj may be declared "closer" if i < j, i.e., the tie-breaking is 
done by indices.) For simplicity, we will suppress T> n in the notation and 
write r n (x) instead of r n (x;D n ). Stone [37] showed that, for all p > 1, 
E[r n (X)-r(X)] p -> for all possible distributions of (X, Y) with E\Y\ P < oo, 
whenever k n — > oo and k n /n — >■ as n — > oo. Thus, the fc-NN estimate be- 
haves asymptotically well, without exceptions. This property is called L p 
universal consistency. 




A' 
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Clearly, any affine transformation of the coordinate axes influences the k- 
NN estimate through the norm ||.||, thereby illuminating an unpleasant face 
of the procedure. To illustrate this remark, assume that a nontrivial affine 
transformation T : z i— > Az+h (that is, a nonsingular linear transformation A 
followed by a translation b) is applied to both x and Xi, . . . , X n . Examples 
include any number of combinations of rotations, translations, and linear 
rescalings. Denote by V' n = (T(Ki), Yi), . . . , (T(X n ), Y n ) the transformed 
sample. Then, for such a function T, one has r n (x;D n ) ^ r n (T(x.);V n ) in 
general, whereas r(X) = E[Y|T(X)] since T is bijective. Thus, to continue 
our discussion, we are looking in essence for a regression estimate r n with 
the following property: 



We call r n affine invariant. Affine invariance is indeed a very strong but 
highly desirable property. In M. d , in the context of fc-NN estimates, it suffices 
to be able to define an affine invariant distance measure, which is necessarily 
data-dependent. With this objective in mind, we develop in the next section 
an estimation procedure featuring ( II. ip which in form coincides with the k- 
NN estimate, and establish its consistency in Section 3. Proofs of the most 
technical results are gathered in Section 4. 

It should be stressed that what we are after in this article is an estimate of r 
which is invariant by an affine transformation of both the query point x and 
the original regressors Xi, . . . ,X n . When the sole regressors are subject to 
such a transformation, it is then more natural to talk of "affine equivariant" 
regression estimates rather than of "affine invariant" ones; this is more in line 
with the terminology used, for example, in Ollila, Hettmansperger, and Oja 
[2%] and Ollila, Oja, and Koivunen [21]. These affine invariance and affine 
equivariance requirements, however, are strictly equivalent. 

There have been many attempts in the nonparametric literature to achieve 
affine invariance. One of the most natural ones relates to the so-called 
transformation-retransformation proposed by Chakraborty, Chaudhuri, and 
Oja [3J. That method and many variants have been discussed in texts such as 
[9] and pT] for pattern recognition and regression, respectively, but they have 
also been used in kernel density estimation (see, e.g., Samanta [36]). It is 
worth noting that, computational issues aside, the transformation step (i.e., 
premultiplication of the regressors by M" 1 , where M n is an affine equivariant 
scatter estimate) may be based on a statistic M n that does not require finite- 
ness of any moment. A typical example is the scatter estimate proposed in 
Tyler [38] or Hettmansperger and Randies [20] . Rather, our procedure takes 
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ideas from the classical nonparametric literature using concepts such as mul- 
tivariate ranks. It is closed in spirit of the approach of Paindaveine and Van 
Bever [31], who introduce a class of depth-based classification procedures 
that are of a nearest neighbor nature. 

There are also attempts at getting invariance to other transformations. The 
most important concept here is that of invariance under monotone transfor- 
mations of the coordinate axes. In particular, any strategy that uses only the 
coordinatewise ranks of the Xj's achieves this. The onus, then, is to show 
consistency of the methods under the most general conditions possible. For 
example, using an L p norm on the (i-vectors of differences between ranks, 
one can show that the classical fc-NN regression function estimate is univer- 
sally consistent in the sense of Stone [3Z]. This was observed by Olshen [30J, 
and shown by Devroye jH] (see also Gordon and Olshen [Tj)J [TB], Devroye 
and Krzyzak [10], and Biau and Devroye [2] for related works). Rules based 
upon statistically equivalent blocks (see, e.g., Anderson [1], Quesenberry and 
Gessaman [M], Gessaman [13] . Gessaman and Gessaman [13], and Devroye, 
Gyorfi, and Lugosi [HI Section 21.4]) are other important examples of re- 
gression methods invariant with respect to monotone transformations of the 
coordinate axes. These methods and their generalizations partition the space 
with sets that contain a fixed number of data points each. 

It would be interesting to consider in a future paper the possibility of morph- 
ing the input space in more general ways than those suggested in the previous 
few paragraphs of the present article. It should be possible, in principle, to 
define appropriate metrics to obtain invariance for interesting large classes 
of nonlinear transformations, and show consistent asymptotic behaviors. 

2 An affine invariant fc-NN estimate 

The A;-NN estimate we are discussing is based upon the notion of empirical 
distance. Throughout, we assume that the distribution of X is absolutely 
continuous with respect to the Lebesgue measure on M. d and that n > d. 
Because of this density assumption, any collection Xi 1 ,...,Xj d (1 < i\ < 
ii < . . . < id < n) of d points among X l5 . . . ,X n are in general position 
with probability 1. Consequently, there exists with probability 1 a unique 
hyperplane in M. d containing these d random points, and we denote it by 
^(Xj i; . . . , XjJ. 

With this notation, the empirical distance between c?-vectors x and x' is 
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defined as 



p„(x,x') = 2^ 1 {segment (x,x') intersects the hyperplane H(x il ,...,x id )}- 

l<il<...<i £ j<n 

Put differently, p n (x, x') just counts the number of hyperplanes in ~R d pass- 
ing through d out of the points Xi, . . . , X n , that are separating x and x'. 
Roughly, "near" points have fewer intersections, see Figured] that depicts an 
example in dimension 2. 




Figure 1: An example in dimension 2. The empirical distance between x and 
x' is 4. (Note that the hyperplane defined by the pair (3, 5) indeed cuts the 
segment (x, x'), so that the distance is 4, not 3.) 

This hyperplane-based concept of distance is known in the multivariate rank 
tests literature as the empirical lift-interdirection function (Oja and Pain- 
daveine [27], see also Randies [35], Oja [26], and Hallin and Paindaveine [18] 
for companion concepts). It was originally mentioned (but not analyzed) in 
Hettmansperger, Mottonen, and Oja [TS], and independently suggested as an 
affine invariant alternative to ordinary metrics in the monograph of Devroye, 
Gyorfi, and Lugosi [9j Section 11.6]. We speak throughout of distance even 
though, for a fixed sample of size n, p n is only defined with probability 1 and 
is not a distance measure stricto sensu (in particular, p n (x, x') = does not 
imply that x = x'). Nevertheless, this empirical distance is invariant under 
affine transformations x i— > Ax + b, where A is some arbitrary nonsingular 
linear map and b any offset vector (see, for instance, Oja and Paindaveine 
[2JJ Section 2.4]). 
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Now, fix x e M. d and let p n (x, Xj) be the empirical distance between x and 
some observation X, in the sample Xi,...,X n . (That is, the number of 
hyperplanes in W 1 passing through d out of the observations Xi,...,X n , 
that are cutting the segment (x, Xj)). In this context, the fc-NN estimate we 
are considering still takes the familiar form 



with the important difference that now the data set (Xi, Yi), . . . , (X n , Y n ) is 
reordered according to increasing values of the empirical distances p„(x, Xj), 
not the original Euclidean metric. By construction, the estimate r n has 
the desired affine invariance property and, moreover, it coincides with the 
standard (Euclidean) estimate in dimension d — 1. In the next section, we 
prove the following theorem. The distribution of the random variable X is 
denoted by \i. 

Theorem 2.1 (Pointwise L p consistency) Assume that X has a proba- 
bility density, that Y is bounded, and that the regression function r is \i- 
almost surely continuous. Then, for ^-almost all x e M d and all p > 1, if 
k n — > oo and k n /n — > 0, 



The following corollary is a consequence of Theorem 12.11 and the Lebesgue 
dominated convergence theorem. 

Corollary 2.1 (Global L p consistency) Assume thatX has a probability 
density, that Y is bounded, and that the regression function r is fi-almost 
surely continuous. Then, for all p > 1, if k n — > oo and k n /n — > 0, 



The conditions of Stone's universal consistency theorem given in [37] are not 
fulfilled for our estimate. For the standard nearest neighbor estimate, a key 
result used in the consistency proof by Stone is that a given data point cannot 
be the nearest neighbor of more than a constant number (say, 3 d ) other 
points. Such a universal constant does not exist after our transformation is 
applied. That means that a single data point can have a large influence on 
the regression function estimate. While this by itself does not imply that 
the estimate is not universally consistent, it certainly indicates that any such 
proof will require new insights. The addition of two smoothness constraints, 




E |r n (x) — r(x)| p — > as n — >• oo. 



E|r n (X)-r(X)| p 



— > as n — >• oo. 
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namely that X has a density (without, however, imposing any continuity 
conditions on the density itself) and that r is //-almost surely continuous, is 
sufficient. 

The complexity of our procedure in terms of sample size n and dimension d 
is quite high. There are possible choices of hyperplanes through d points. 
This collection of hyperplanes defines an arrangement, or partition of M. d 
into polytopal regions, also called cells or chambers. Within each region, the 
distance to each data point is constant, and thus, a preprocessing step might 
consist of setting up a data structure for determining to which cell a given 
point x e M. d belongs: This is called the point location problem. Meiser [21] 
showed that such a data structure exists with the following properties: (1) 
it takes space 0(n d+£ ) for any fixed e > 0, and (2) point location can be 
performed in Oilogn) time. Chazelle's cuttings jl] improve (1) to 0(n d ). 
Chazelle's processing time for setting up the data structure is 0{n d ). Still in 
the preprocessing step, one can determine for each cell in the arrangement the 
distances to all n data points: This can be done by walking across the graph 
of cells or by brute force. When done naively, the overall set-up complexity is 
0(n 2d+1 ). For each cell, one might keep a pointer to the k nearest neighbors. 
Therefore, once set up, the computation of the regression function estimate 
takes merely 0(logn) time for point location, and O(k) time for retrieving 
the k nearest neighbors. 

One could envisage a reduction in the complexity by defining the distances 
not in terms of all hyperplanes that cut a line segment, but in terms of 
the number of randomly drawn hyperplanes that make such a cut, where 
the number of random draws is now a carefully selected number. By the 
concentration of binomial random variables, such random estimates of the 
distances are expected to work well, while keeping the complexity reasonable. 
This idea will be explored elsewhere. 

3 Proof of the theorem 

Recall, since X has a probability density with respect to the Lebesgue mea- 
sure on M. d , that any collection X il5 . . . , X id (1 < i\ < i 2 < . . . < id < n) of 
d points among X 1; . . . ,X n defines with probability 1 a unique hyperplane 
H(X il , . . . , XjJ in R d . Thus, in the sequel, since no confusion is possible, we 
will freely refer to "the hyperplane ^(X^, . . . , X id ) defined by X ii; . . . , X id " 
without further explicit mention of the probability 1 event. 

Let us first fix some useful notation. The distribution of the random variable 
X is denoted by \i and its density with respect to the Lebesgue measure is 
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denoted by /. For every e > 0, we let £> xe = {y 6 l d : ||y — x|| < e} be 
the closed Euclidean ball with center at x and radius e. We write A c for the 
complement of a subset A of M, d . For two random variables Z\ and Z 2 , the 
notation 

Z\ <st Z 2 

means that Z\ is stochastically dominated by Z 2 , that is, for all t £ R, 

P{Z X > t} < F{Z 2 > t}. 

Our first goal is to show that for //-almost all x, as k n /n — > 0, the quantity 
maxj =1 r . jfcn ||X(j)(x) — x|| converges to in probability, i.e., for every e > 0, 

limPi max ||X«)(x) - x|| > e \ = 0. (3.1) 

n— ¥oo I i=l, ...,k n 

So, fix such a positive e. Let 5 be a real number in (0, e) and 7 n be a positive 
real number (eventually function of x and e) to be determined later. To 
prove identity (13. ip . we use the following decomposition, which is valid for 
all x G R d : 



P < max ||X(j)(x) — x|| > e 



< P <( min p n (x,Xj) < 7„ 



+ P < max p n (x, Xi) > 7„, 

1 i— l,...,n 

+ P {Card {i = 1, . . . , n : ||X - x|| < 5} < k n } 
:=A + B + C. (3.2) 

The convergence to of each of the three terms above — from which identity 
(13.11) immediately follows — are separately analyzed in the next three para- 
graphs. 



Analysis of A. As for now, taking an affine geometry point of view, we 
keep x fixed and see it as the origin of the space. Recall that each point in 
the Euclidean space M. d (with the origin at x) may be described by its hy- 
perspherical coordinates (see, e.g., Miller [23 Chapter 1]), which consist of 
a nonnegative radial coordinate r and d — 1 angular coordinates 9\, . . . , 9d-i, 



S 



where 9d~\ ranges over [0, 2n) and the other angles range over [0, tt] (adap- 
tation of this definition to the cases d = 1 and d — 2 is clear). For a 
(d— 1) -dimensional vector O = (#i, . . . , 9d-i) of hyperspherical angles, we let 
£>x,e(@) be the unique closed ball anchored at x in the direction and with 
diameter e (see Figure [2] which depicts an illustration in dimension 2). We 
also let £ x (0) be the axe defined by x and the direction B, and let as well 
£x,e(@) be the open segment obtained as the intersection of £ x (0) and the 
interior of £> x , £ (6). 



£ x (6) 




Figure 2: The ball i3 Xj£ (0) and related notation. Illustration in dimension 2. 



Next, for fixed x, e and 0, we split the ball jB x , £ (0) into 2 d 1 disjoint regions 
7^x e (0)? • • • iT^fe^Q) as follows. First, the Euclidean space R d is sequen- 
tially divided into 2 d ~ l symmetric quadrants rotating around the axe £ x (0) 
(boundary equalities are broken arbitrarily). Next, each region 7?.£ e (6) is 
obtained as the intersection of one of the 2 d ~ 1 quadrants and the ball i3 x>£ (6). 



The numbers of sample points falling in each of these regions are denoted 
hereafter by A^ )£ (9), . . .,N^~\e) (see Figure [2]). Letting finally V d be the 
volume of the unit d- dimensional Euclidean ball, we are now in a position to 
control the first term of inequality (13.21) . 

Proposition 3.1 For fi-almost all x G lR d and all s > small enough, 
P < min p n (x, Xj) < 7„ > — > as n — >• oo, 
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provided 

Proof of Proposition 13.11 Set 

p Xj£ = min inf \i {7% (0) } , 

j=l,...,2 d - 1 6 

where the infimum is taken over all possible hyperspherical angles 0. We 

know, according to technical Lemma 14.11 that for /i-almost all x and all 
e > small enough, 

Px, e > |^/(x) > 0. (3.3) 

Thus, in the rest of the proof, we fix such an x and assume that e is small 
enough so that the inequalities above are satisfied. 

Let X* be defined as the intersection of the line (x, X) with £> Xj£ , and let 0* 
be the (random) hyperspherical angle corresponding to X* (see Figure |3] for 
an example in dimension 2). 

#X 





/ £ 


0* 7 











Figure 3: The ball £> xe (0*) in dimension 2. 
Denote by iV X)e (0*) the number of hyperp lanes passing through d out of the 



10 



observations X 1; . . . , X n and cutting the segment <S X£ (0*). We have 



i — 1, . . . ,n 

x, ; e8£ 



P { min p„(x,X 4 ) < 7n ) < nP{p„(x,X*) < 7n } 

nP{iV x , £ (e*) < 7 J 



n 



2 d - 1 -d 



< 7n 



where the last inequality follows from technical Lemma 14.21 Thus, 



P < min p r 

■ i— l,...,n 



(x,X i )< 7 n[ <n^P|^i.(e*)< ( 
2 d-i 



\ l/2 d ~ 1 



/2 d - 1 n l-d/2 d - 1 



Clearly, conditionally on 0*, each iV^ e (0*) satisfies 

Binomial (n,p x , £ ) < st iV Xi£ (e*) 
and consequently, by inequality (13.31) . 

Binomial (n, ^e d f(^j < st N^Q*). 

Thus, for each j = 1, . . . , 2 d_1 , by Hoeffding's inequality for binomial random 
variables (Hoeffding [21]), we are led to 



P 



{Ni, £ m<l 1 J 2d - 1 n^ d - 1 } 



= E 
< exp 



p 7vi £ (e*)< 7 y2 ti - 1 n i-^ rf - l |e 



as soon as 7^ 2 n 1 d / 2d 1 < n^e d f(x.). Therefore, taking 



7n = n d (^^/(x; 
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we obtain 



P ^ min p n (x,Xi) < 7n } < 2 d Vexp 

1 i— l,...,n 

The upper bound goes to as n — > oo. ■ 

Analysis of B. Consistency of the second term in inequality (13. 2p is es- 
tablished in the following proposition. 

Proposition 3.2 For [i- almost all x G W 1 , all e > and all 5 > smaZZ 
enough, 



provided 



P < max p n (x, Xj) >7n/— >0 as n — > oo 

' i—l,...,n 



Proof of Proposition [5721 Fix x in a set of ^-measure 1 such that /(x) > 
and denote by iV x 5 the number of hyperplanes that cut the ball i3 x ,<5- Clearly, 



P ^ max p n (x, X,) > 7n ^ < P {iV x , 5 > 7n } . 

1 i=l, ...,n 

Observe that, with probability 1, 

Ax,<5 = ^ l{^(x il ,...,x« d )nB x , { ^0}, 

l<ii<...<i t i<n 

whence, since X 1; . . . , X n are identically distributed, 

E[iv x , 5 ] = nj p {^(x l5 . . . , x d ) n £ x , 5 ^ 0} 
<-P{W(x lr ..,x d )nB x ,^0}. 

Consequently, given the choice (13 .4p for 7 n and the result of technical Lemma 
I4.3[ it follows that 

E[N X , S ] < 7 n/2 



12 



for all 5 small enough, independently of n. Thus, using the bounded difference 
inequality (McDiarmid [22]), we obtain, still with the choice 

2 d_1 

7n = n d (^'/(x)) , 



P{iVx,«5 > In} < F{iV Xi5 - E[N X , S ] > 7n /2} 

xin/2r 



< exp -2 



exp 



n 



2d-l 



This upper bound goes to zero as n tends to infinity, and this concludes the 
proof of the proposition. ■ 



Analysis of C. To achieve the proof of identity (13 .ip . it remains to show 
that the third and last term of (13.21) converges to 0. This is done in the 
following proposition. 

Proposition 3.3 Assume that k n /n — > as n — > oo. Then, for [i-almost all 
x G lR d and all 5 > 0, 

P {Card {i — 1, . . . , n : ||Xj — x|| < 5} < k n } — > as n — ^ oo. 

Proof of Proposition [373l Recall that the collection of all x with n(B x>T ) > 
for all r > is called the support of /i, and note that it may alternatively 
be defined as the smallest closed subset of R d of /^-measure 1 (Parthasarathy 
[3"2"l Chapter 2]). Thus, fix x in the support of ji and set 

so that p Xi< 5 > 0. Then the following chain of inequalities is valid: 

P {Card {i = 1, . . . , n : ||X< - x|| < 5} < k n } 
= P {Binomial (n,p^s) < k n } 

< P {Binomial {n^p^s) < ^Px,i/2} 

(for all n large enough, since /c„/n tends to 0) 

< exp(-np^/2), 

where the last inequality follows from Hoeff ding's inequality (Hoeffding [21 j ) . 
This terminates the proof of Proposition 13.31 ■ 
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We have proved so far that, for ^-almost all x, as k n /n — > 0, the quantity 
maxj =1 .. . fc n ||X(j)(x) — x|| converges to in probability. By the elementary 
inequality 



E 



y £ l{||x (i) (x)-x||> B } < F f ™ ax n ||X (i) (x) - x|| > e\ 

n i=l ' 



it immediately follows that, for such an x, 



E 



i=l 



|X (i) (x)-x||>e} 



(3.5) 



provided k n /n — > 0. We are now ready to complete the proof of Theorem 

o 

Fix x in a set of //-measure 1 such that consistency (13.51) holds and r is 
continuous at x (this is possible by the assumption on r). Because |a + 6| p < 
2 p ~ 1 (|a| p + \b\ p ) for p > 1, we see that 



ElrJx) -r(x)T < 2 p - x E 



+ 2 p - x E 
Thus, by Jensen's inequality, 

Elrjx) -rfxlT < 2 P_1 E 



i '" n 

-E[ r «( x )- r ( x w( x ))] 

i=i 

r £[r (X (i) (x))-r(x)] 



^E[^)(x)-r(X w (x))] 

k 

lX>(X (0 (x))-r(x)| 



2 P ~ 1 E 

. — 2 P I n -)- 2 P J n . 



Firstly, for arbitrary £ > 0, we have 

fen 



J n = E 



+ E 



7T Z l r ( X «( x )) - r ( x )T l{||x (0 (x)-x||> e } 
irZ l r ( X «( x )) -' , (x)ri{||x (0 (x)-x||< e } 



i=l 
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whence 



J n < 2 P ( P E 



-i fen 



x (o( x )- x ll >£ } 



+ 



sup 

y- x ll<£ 



ny) 



r x 



(since \Y\ < (). 

The first term on the right-hand side of the latter inequality tends to by 
( 13. 5p as k n /n — > 0, whereas the rightmost one can be made arbitrarily small 
as e — > since r is continuous at x. This proves that J n — > as n — > oo. 

Next, by successive applications of inequalities of Marcinkiewicz and Zyg- 
mund [22] (see also Petrov [33j pages 59-60]), we have for some positive 
constant C p depending only on p, 



< C P E 



-i fan 



(X W (x))| 



p/2 



< 



(2(yc p 



p/2 



Consequently, 
theorem. 



h 

(since \Y\ <(,)■ 

I n — ¥ as k n — y oo, and this concludes the proof of the 



4 Some technical lemmas 

The notation of this section is identical to that of Section 3. In particular, 
it is assumed throughout that X has a probability density / with respect to 
the Lebesgue measure A on W 1 . This requirement implies that any collection 
X^, . . . , X id (1 < ii < %2 < . . . < id < n) of d points among Xi, . . . , X n 
define with probability 1 a unique hyperplane K(Xj i; . . . , X id ) in M d . Recall 
finally that, for x G ~R d and e > 0, we set 

p Xj£ = min inf fi {H 3 (G) } , 

j=l,...,2 d - 1 6 

where the infimum is taken over all possible hyperspherical angles 0, and the 
regions Ki £ (0), j = 1, . . . , 2 d ~\ define a partition of the ball B Xi£ (9). Recall 
also that the numbers of sample points falling in each of these regions are 
denoted by iV^ £ (6), . . . ,N^ e (0). For a better understanding of the next 
lemmas, the reader should refer to Figure [2] and Figure [3j 
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Lemma 4.1 For fi- almost all x£K d and all e > small enough, 



Proof of Lemma 14.11 We let x be a Lebesgue point of /, that is, an x 
such that for any collection A of subsets of Bo,i with the property that for 
all A G A, X(A) > cA(jE?o,i) for some fixed c> 0, 



lim sup 



£->0 



/(y)dy 

-/(x) 



\{x + eA} 



(4.1) 



where x + eA = {y e l d : (y — x)/e G A}. As / is a density, we know that 
//-almost all x satisfy this property (see, for instance, Wheeden and Zygmund 
|H5]). Moreover, since / is //-almost surely positive, we may also assume that 
/(x) > 0. 

Thus, keep such an x fixed. Fix also j G {1, . . . , 2 d ~ 1 }, and set 

pi )£ = inf//{74 i£ (e)}. 

Taking for A the collection of regions TZ J x (0) when the hyperspherical angle 
varies, that is, 



A = {Ri fl (e):ee[o, 



7l] d - 2 X 



[0,2vr)} 



and observing that 

x{ni jE (e)}-- 

we may write, for each j = 1, . . . , 2 d ~ 1 

2 d ~ i P i 



V d fe\ d 



2 d ~ l \2 



V d (e/2) d 



-/(x)^f#M-/(x) 
e A{72i, e (0)} n 1 

/(y)dy 



inf — - /( x ; 

AeA \{x + eA} v ' 



< sup 



/ /(y)dy 

,x+sA f(V 

A{x + £A} I{ J 



The conclusion follows from identity (14.11) . 
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Lemma 4.2 Fix* G R d , e > and 6 G [0, 7r] d ~ 2 x [0, 2tt). Let N^ £ (Q) be the 
number of hyperplanes passing through d out of the observations X 1; . . . , X n 
and cutting the segment «S Xj£ (0). Then, with probability 1, 

, ivi £ (e)...iv2 d ; 1 (e) 

AM©) > — ^izf ; . 

Proof of Lemma 14.21 If one of the N£ e (Q) (j — 1, . . . , 2 d_1 ) is zero, then 
the result is trivial. Thus, in the rest of the proof, we suppose that each 
iV xe (©) is positive and note that this implies n > 2 d ~ 1 . 

Pick sequentially 2 d ~ 1 observations, say X^, . . . , Xj (J _ 1 , in the 2 d_1 regions 

7^x £ (0), • • • j "^-x^ 1 (©)- By construction, the polytope defined by these 2 d ~ 1 
points cuts the axe £ X)£ (0). Consequently, with probability 1, any hyper- 
plane drawn according to d out of these 2 d ~ 1 points cuts the segment <S X;£ (0). 
The result follows by observing that there are exactly N^ e (Q) . . . N^ d £ 1 (Q) 
such polytopes. ■ 

Lemma 4.3 For 1 < z'i < . . . < < n, let 7^(Xj i; . . . , Xj d ) be the hyper- 
plane passing through d out of the observations X 1; . . . ,X n . Then, for all 
x G R d , 

¥{H(X h , . . . ,XiJ n B x , 5 ^ 0} -> as <U 0. 

Proof of Lemma 14.31 Given two hyperplanes % and %' in K d , we denote 
by ${H,H') the (dihedral) angle between H and "H'. Recall that $(H,W) G 
[0, 7r] and that it is defined as the angle between the corresponding normal 
vectors. 

Fix 1 < i\ < . . . < id < n. Let Ss be the event 

S s = {||Xi 3 - x|| > S : j = 1, . . . ,d- 1} , 

and let "H(x, X ii; . . . ,X id l ) be the hyperplane passing through x and the 
d—1 points Xj i; . . . , X id _ 1 . Clearly, on Ss, the event {%(Xj i; . . . , Xj d )nB x ,5 ^ 
0} is the same as 

{$ (^(x, X n , . . . , X id _J, -H(X n , . . . , X id )) < $ 5 } , 

where $5 is the angle formed by "H(x, X il5 . . . , Xj d _ 1 ) and the hyperplane 
going trough X^, . . . , X itJ _ 1 and tangent to B^s (see Figure 0] for an example 
in dimension 2). 

Thus, with this notation, we may write 

p{H(x n ,...,xjn£ Xi( 5^0} 

< ¥{S C J + P {^(Xi, , . . . , x id ) n B Xi5 + 0, £„} 

< P{^} + P {$ (H(x,X ii; . . . .X^J.^Xfc, . . . ,X id )) < $ 5 ,£ re } . 
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Figure 4: The hyperplanes "H(x, X il5 . . . , X id J and H(Xi 1 , . . . , XjJ, and 
the angle Illustration in dimension 2. 

Since X has a density, the first of the two terms above tends to zero as 5 4- 0. 
To analyze the second term, first note that, conditionally on X ii; . . . , Xj d _ i: 
the angle ^(^(x, X^, . . . , X id ^(X^, . . . , X id )) is absolutely continuous 
with respect to the Lebesgue measure on IR. This follows from the following 
two observations: i) the random variable Xj d has a density with respect 
to the Lebesgue measure on WL d , and ii) conditionally on X it , . . . , Xj d _ 1 , 
$(?{(x, Xjj, . . . , Xj d 1 ), 1-tiX^, . . . , Xj d )) is obtained from X id via transla- 
tions, orthogonal transformations, and the arctan function. 

Thus, writing 

P{$(H(x,X n ,...,X Jd _ 1 ),^(X n ,...,X, d )) <^ s ,£ n } 
= E[l £n ¥{^(H(^X h ,...,X id _ 1 ),n(X il ,...,X id ))< 

and noting that, on the event £ n , for fixed X^, . . . , X id _ 1 , $5 I as 5 j. 0, we 
conclude by the Lebesgue dominated convergence theorem that 

P {$ (ft(x, X n , . . . , X id _J, -H(X n , . . . , X id )) < $4 -»• as 5 | 0. 
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